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series of computer simulations using 2,000 White and 2,000 African 
American examinees and item responses from previous administrations 
of the American College Testing Mathematics Usage Test. Results 
indicate that the magnitude of the problem, in terms of being able to 
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Proportions Correct When Population Ability Distributions are Incongruent 
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Abstract 



A popular method of analyzing test items for differential item functioning (DIF) is 
to compute a statistic that conditions samples of examinees from different populations on 
an estimate of ability. This conditioning or matching by ability is intended to produce an 
appropriate statistic that is sensitive to true differences in item functioning, provided the 
ability estimate accurately reflects a comparable level of the true ability for these 
populations. If the observed or number-correct score is used as a conditioning or 
grouping variable, a problem exists whenever examinees from two different populations 
are matched on the same level of the observed test score, but actually have quite 
different levels of the unobserved ability. This occurs whenever the distributions of true 
abilities for the populations of interest are incongruent or non-overlapping. This 
situation was investigated in a series of computer simulations. The results indicated that 
the magnitude of the problem, in terms of being able to detect true DIF with moderate 
sample sizes when ability distributions are incongruent, may not be that serious for tests 
which are, on average, free from DIF. 
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Two statistics that are used to indicate differential item functioning (DIF) 
between two populations of examinees are the Mantel-Haenszel common-odds ratio 
(MH) (or equivalently the Mantel-Haenszel negative log-odds ratio) (Holland & Tliayer, 
1986) and the standardized difference (STD) in proportions correct (Dorans & Kulick, 
1986). Both statistics condition on some ability measure, usually the observed score of 
the test containing the items undergoing the DIF analysis. Conditioning on th< 
observea test score in order to evaluate population differences in item proportion correct 
would appear to be appropriate provided the matching observed test score accurately 
reflects a comparable level of the measured trait for the populations of interest. 
However, problems arise whenever identical values of the observed test score, X, 
represent different levels of ability across groups. This can occur when the conditional 
distributions of abih'ty given observed score are different for the comparison groups used 
in the DIF analysis. 

Zwick (1990) has discussed the implications of this problem within a theoretical 
context. The purpose of the current paper is to present a more applied analysis of this 
problem and to attempt to determine how severe the situation must be before a DIF 
analysis that employs the MH or STD statistic leads to erroneous conclusions. 




Dennitions of the DIF Statistics 

The definitions of the estimator of the standardized difference in proportions 
correct (STD) and the Mantel-Haenszel common-odds ratio estimator (MH) are given as 
follows. 

If the two populations of examinees are labeled as a focal group (F) and a base 
group (B), and s indexes each observed score category of a ^-item test, or 5 = 0, 1, ky 
then 

- the number of examinees in the F group at score s, 
A^j3 - the number of examinees in the B group at score 5, 
A^^ - the number of examinees in F and B at score s, 

k 

Gy - A'j: / the relative frequency of F at s, 

^ ^ s-0 ^ 

k 

" / 51 » relative frequency of B at .v, and 

^ ^ s-0 ^ 

k 

G^" / ^ '% the total relative frequency of F and B at s. 

s-0 

If Ry:^ and R^^^ are the numbers of examinees (i.e., absolute frequency), in F and B 
respectively, at s who answer the item of interest correctly, then the proportion-correct 
values for each group at s are given by - R^ / , and P^^ - R^^ /N^ . 

s s s s s s 

The STD Statistic 

Tlie standardized difference in proportions correct is defined as 

STD - X:(P,.. - P,,) C,. , (1' 

s-0 



where the signed difference, Pj.,^ - P^^^, is weighted by the relative frequency off. The 
statistic is defined on the proportion-correct scale and indicates, on average, how 
members of F differed from comparable members of B. Negative values of STD indicate 
that an item favors S, while positive values indicate that an item favors F. Values of the 
STD statistic near zero indicate no DIF. 
The MH Statistic 

If W^^ and W^^ are the absolute frequencies of incorrect responses to this item in 
F and B respectively at 5, and A^^ is the total number of responses at s, then the 
Mantel'Haenszel common-odds ratio estimator is 

MH - l:^ . (2) 

s-O 

If and Q^^ are defined as (1 - and (1 - P^^^ respectively, then this 
statistic could also be written as 

MH - — ^ . (3) 



s-O O, 



The MH statistic can be interpreted as an estimate of the common odds-ratio. It 
indicates, on average, how much more (or less) likely it is that a member of B answered 
the item correctly than did a comparable member of F. The MH statistic has a value at 
or near 1.0 if there is no DIF between B and F. If the item favors B, MH is "reater than 
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1.0; if the item favors F, MH is less than 1.0. Frequently, the odds ratio is transformed 
by using some function of the negative of the natural log of the ratio. 
Population DIF Indices 

The DIF statistics given by Equations 1 and 2 are defined in terms of the 
observed test score. As mentioned previously, the examinees from each group ideally 
should be matched on their latent abilities or true scores. For computer simulation 
work, it is possible to define some measure of population DIF in terms of the latent 
ability or true score. This value then becomes the parameter of interest in estimation 
because it represents the value of the indices when true ability matching or conditioning 
has occurred. The DIF statistics can be compared to these population DIF indices, 
which then serve as a reference for valid DIF identification. 

The usual assumption concerning the latent ability or true score can be made, 
namely that the latent ability, 6, is a continuous random variable with known density 
functions. If these arbitrary density functions of G are denoted by and ^^(9), then 
the combined group density can be represented by 

where a mixing proportion, a, is defined as 0 < a < 1. The mixing proportion is usually 
taken to equal the relative proportion of examinees who appear in F (either sampled or 
in the F population). 

The definitions of each population DIF index are facilitated by replacing the 
proportions correct and incorrect at each score category (i.e., Pj^^, Qj; , Pj; and ) with 
probability functions of the latent ability variable, 6. In the context of the present paper, 

ERIC 



it was assumed that the success probabilities, ^^(G) and i^p(9), were given by the 
unidimensional three-parameter logistic item response function with known item 
parameters for each group and for each item, or in general by 

P{e) - c ^ . (4) 

A population value of STD, MstD' defined as the expected difference between 
the proportions correct, relative to (or weighted by) g^{Q) as the standardizing 
distribution (Kendall, Stuart, & Ord, 1987, p. 46), or 

m 

^si-v - jiPim-Pnmgm de . (5) 

The population value of the common-odds ratio, tir, was defined to be the latent 
variable-equivalent to Equation 3, or 

.4 ''''' • (^^) 

rF,(e)e3(B)^»l d9 



Defining Equation 6 as the population value of the common-odds ratio is not 
^ ithout some interpretative difficulties. Greenland (1982) pointed out that, although 
there are several interpretations of an odds ratio when the ratio is not assumed to be 
homogeneous in the population (i.e., the odds ratio is not constant across different values 
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of 0), the weights, {G^^^G^:^)/G \, used in the MantcMlaenszcl estimator have no logical 
interpretation in the population. However, within the context of the present study, it was 
more important to compare the effects of conditioning on the observed score as opposed 
to the true latent variable, 6, than to defend one population interpretation over another. 
And because different definitions of the population odds ratio can result in quite 
discrepant values for the odds ratio parameter (Greenland, 1982), the definition given in 
Equation 6 was chosen so that any confounding of results which could be attributed to 
an inconsistent choice of the population odds ratio (i.e., inconsistent with the MI I 
statistic) would be eliminated. 

Prior Ability Distributions 

Previously, it was stated that if examinees from different populations have been 
matched on observed test scores, they might not be matched on latent abilities. This 
occurs whenever the conditional distributions of true score given obseived score are 
different for the two groups. 

Zwick (1990) showed that if the test reliabilities for both groups were less than 
1,0, and if the means of the ability or true-score distributions for each group were not 
equal so that the ability distributions were incongruent, then the conditional distributions 
of true score given observed score would not be identical but would result in conditional 
distributions that were described as being stocluisticclly ordered. Under certain 
circumstances, this could produce results that would lead to the MH DIP statistic 
erroneously favoring the group with the higher abihty. Regardless of the order of the 
conditional error distributions of obser\'ed score given ability, or /(.Y|e), if such 
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distributions exist for both groups, then different distributions of ability, g(6), will yield 
different conditional distributions of ability given observed score, ;(6|A^, due to Bayes 
theorem. 

Degree of Distributional Incongruence 

A measure of the degree with which the two distributions of 9, gi^{Q) and g^iQ) , 
are incongruent is the percentage of overlap of the areas under the density functions. 
This measure allows for an infinite number of combinations of distributions to be 
mapped to a simple scalar between 0.0 (signifying no overlap or total incongruence) and 
1.0 (complete overlap, or total congruence), and is defined by 



OVERLAP - 



MiN[g3(e),gp(e)] de . (7) 



Method 

The present study utilized computer simulation methods in order to manipulate 
the primary condition of interest, the degree of incongruence or overlap between the 
distributions of ability of two populations of examinees (B and F). In order to make the 
results more generalizable to real testing situations, item responses taken from previous 
administrations of a 40-item ACT Assessment Mathematics Usage Test were fit using a 
three-parameter logistic model that assumed a unidimensional examinee trait or ability. 
Two comparison samples of 2000 Caucasian and 2000 African-American examinees were 
used to obtain separate B and F group item parameter estimates. Marginal maximum 
likelihood procedures, which assumed standard normal prior ability distributions, were 
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used on each of the two samples via the computer program, PC-BILOG 3 (Mislevy & 
Bock, 1989). 

Because the groups were thought to be nonequivalent, the item parameters from 
the F group (op, bp, and r^) were rescaled (a'p, and c^) to the B group parameters 
using the family of linear transformations, 



"Real" DIF between the two groups on any of the 40 items was thus somewhat 
reflected in the item parameter estimates.^ As far as goodness-of-fit was concerned, no 
statistical procedure was used to assess the degree of model fit or misfit. Prior 
experience has shown that the unidimensional three-parameter logistic model fits these 
types of mathematics items on samples of 2000 at least well enough to yield item 
parameters that can subsequently produce observed score distributions that are very 
close to those obtained from national administrations of the tests. Therefore, these 
parameters estimates were used as known item parameters in all of the subsequent 
computer simulations. 

The B ability distribution, ^^^3(6), was always assumed to be standard normal. 
Therefore, only^j:(6) varied throughout the simulations, and the measure of 
incongruence between the two ability distributions was the proportion of their overlap ( 




where 



A - 



SD(6p) 



, and 5 ^ (^)-^-(^:). 
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area). The F ability distribution, ^'^(e), was normally distributed with variance fixed at 
either 1.0 or .5. The Focal group mean was varied such that Mf(9) - Mb(9)' 

The known item parameters were used to describe the success probabilities, 
PxiQ) and with the item response function given by Equation 4. Once g{;(9) and 

<?n(6) were specified and a value of 6 from either F ox B had been sampled, 40 item 
responses were generated in the usual way by comparing either P^{Q) or ^3(6) to a 
pseudorandomly generated uniform deviate between 0 and 1. Statistics v/ere then 
computed as functions of the item responses from either Equation 1 and 2, and these 
values were compared to /Xgyp and T|r from Equations 5 and 6, respectively. Actually, the 
negative of the natural log of Equation 2 and Equation 6 was computed. Sampling 
variability was achieved by replicating each simulation 100 times and by drawing samples 
of 500 values each of 6 from5f;(e) d.r\(\g^{Q). 

Two methods were used to assess the fidelity of either DIP statistic in the 
identification of an item's true DIP status. One was to compute the bias, standard error 
and root meon square error relative to the population DIP value over replications. 
These values also could be averaged over the 40 test items to obtain single measures of 
estimation accuracy. The second method was to arbitrarily establish a DIP criterion 
value for each true DIP index and then to observe the proportion of true positive and 
true negative DIP identifications or "hit rates" over replications and items. The DIP 
criteria used were |Ms-ml > .10, and |-ln(ilr)| > ln(2). 

Regardless of the degree of incongruence, the simulated test overall was free from 

DIP, as measured by _Lz(Ms-i-u) 4^^"'"(*^- ^^^^^ average population DIP 
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values varied only from -.0116 to .0057 and from -.0409 to .0429, respectively, and 
indicated, on average, a test free from DIF. 



Results 

The results of the computer simulations are summarized in a series of plots given 
by Figures 1-3. Figures 1 and 2 show the results of the MH and STD estimators, 
respectively, in terms of average bias or [(-In MH) - (-In tjf)] (across items), average 
standard error (SE of estimate across items) and average root mean squared error 
(RMSE across items), each as a function of distributional overlap. In each of these 
figures, the solid lines represent ihose situations where the variances of F and B ability 
distributions were equal to 1.0; the dotted lines indicate those situations in which the 
variance of the ability distribution of F was .5 while the variance of the ability 
distribution of B remained at 1.0. 

Insert Figures 1 and 2 About Here 



Figure 3 shows the proportion of times (out of 100 replications) that MH and 
STD accurately identified items as either having no DIF or as having DIF, as measured 
by the DIF criteria given above. Because the simulated tests were, on average, free from 
DIF, these "hit rates" tended to be situations involving tnie negatives (i,e., items without 
DIF). Once again, the solid and dotted lines represented the two different variance 
conditions. 
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Insert Figure 3 Here 



MH Results 

It was anticipated that the bias in MH would become increasingly negative as 
overlap decreased, due to the effect of ordered distributions on MH, as discussed by 
Zwick (1990) (i.e., as gp(e) became less than ^.^(S) in terms of stochastic ordering, 
-In MH should have favored B in terms of the DIF analysis). Figure 1 shows that this 
did not occur. The MH bias remained slightly positive but basically close to zero until 
the percentage of overlap fell to values near .1. The obvious explanation for the 
apparent unbiased behavior of MH even as overlap approached zero was due to the 
presence of empty cells or zero frequencies for many of the score categories. These zero 
contributions to the overall estimate of the log of the common odds-ratio did not affect 
the nO'DIF conclusion. And because this was the true situation, the MH estimates 
appeared to be unbiased. 

The instability of the MH estimator as the percentage of overlap decreased was 
also apparent from the increase in the SE. See Figure 1. Overall, the RMSE remained 
fairly constant until the percentage of distributional overlap was less than .4. This value 
of .4 represented mean differences of -1.75 in the equal variance case and -1.5 in the 
unequal variance case. 
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The correct identification of DIF and no-DIF items remained fairly high, above 
.90, for MH until overlap reached approximately .3. The MH Hit Rate fell off sharply 
after that. See Figure 3. 
STD Results 

Similar findings were noted for the STD estimator. Assuming that the stochastic 
ordering of the two distributions would once again produce results which (falsely) 
favored 5, it was again anticipated that the bias in STD would become increasingly 
negative as overlap decreased. Figure 2 shows that the STD estimator remained fairly 
unbiased once again, even when the overlap percentage approached zero. The SE again 
increased as the two distributions separated, which resulted in an increase in the RMSE. 
These results were consistent across both variance conditions. 

Correct DIF identification with STD was consistently lower than that of MH until 
the percentage of distributional overlap reached .2. After that, the situation was 
reversed with STD performing better than MH. See Figure 3. 
Asymptotic Bias 

In order to determine the effect of sample sizes on these results, it was possible to 
evaluate MH and STD as the number of items, k, remained fixed and the sample sizes 
within the cells of the /c-f i 2X2 tables used to obtain MH and STD increased 
indefinitely. This was done analytically, using a recursive procedure to obtain /(AT] 6) and 
hence, h{X), for each group, as described in Lord and Wingersky (1984, p.454). ^is 
evaluation did two things. First, because the sample sizes were infinite, SE was driven to 
zero. And because the cells contained expected frequencies, zero cell frequencies were 
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eliminated. These analytical values of MH and STD were then compared to Mstd ' 
In ij;. The difference between the analytical and the population value was termed 
asymptotic bias. Figure 4 shows the average asymptotic bias (over items) of MH and 
STD as functions of overlap. In this figure, the anticipated direction of the bias was 
confirmed. Both MH and STD were biased in the direction of B (i.e., negatively). Note 
that the severity of the bias for MH and STD was about the same. The appearance of 
differences between MH and STD in Figure 4 was due to differences in the scales of the 
two estimators. 

Insert Figure 4 Here 



Figures 1, 2, and 4 illustrate an interesting paradox in using the MH and STD 
estimators when the two ability distributions were non-overlapping. The statistics 
remained fairly unbiased /or to^v with no DIF when the sample sizes were moderate, due 
to the many zero cell frequency contributions to MH and STD. However, for these 
moderate sample sizes, the SE was fairly substantial. The end result was that the RMSE 
increased as overlap decreased. Increasing the sample sizes would certainly decrease the 
SE of the MH and STD estimates but coincidentally it would increase the bias. The net 
result would be the same, namely that the RMSE would increase as the percentage of 
overlap decreased. 
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Results for Completely Congruent Cases 

Three other simulations were conducted to show the effects of reduced test 
reliability alone, as opposed to distributional incongruence, on DIF identification. These 
simulations were conducted so that the variances of both latent ability distributions were 
1.0, but Mf8 Moa w^^*^ both set at -1.0, -2.0, and -3.0. The results ^;vere then 
compared to the original case of complete congruence, with and Mb8 equal to 0.0, as 
well as the non-overlapping cases illustrated previously in Figures 1-4. In this way the 
effects due to distributional incongruence could be somewhat separated from those due 
only to reduced reliability. These results are summarized in Table 1. 

Insert Table 1 Here 



M Table 1 illustrates, most of the increases in SE (and, consequently in RMSE) 
and the decreases in Hit Rates seen in Figures 1-4 were due to distributional 
incongruence rather than to lowered test reliabilities alone. As long as the two 
distributions remained congruent, SE and Hit Rates were fairly consistent. And although 
there was some decline in DIF identification performance as reliability decreased, it was 
not as severe as that observed when overlap was less than 1.0. However, it should be 
pointed out that reduced test reliability and distributional incongruence, as modeled in 
these computer simulations, were somewhat confounded. Obviously, it was impossible to 
shift too far in the negative direction without affecting the test reliability of due 
to the nonzero lower asymptote imposed by the three-parameter logistic function. 
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Although distributional incongruence imposed a reduced reliability condition on it was 
necessary to tolerate this confounding unless the item parameters were modified across 
each simulation condition, which was an unappealing alternative. 

Table 1 also shows that the average asymptotic bias remained relatively 
unchanged as long as the two distributions were congruent. The average bias for 
samples of 500 was fairly close to the asymptotic results, even when test reliability was 
reduced. 



Although the results of these simulations were obtained using item response 
models estimated from a specific test and abilities generated from specific distributions, 
it is believed that these results are generalizable to a broader class of testing situations 
because of the wide range of distributional incongruence studied and because the test 
that was used to generate the responses was typical of many achievement tests. The 
major conclusion drawn from this study was that the use of the observed score, X, as a 
latent ability surrogate in computing MH and STD appeared to be acceptable, even 
when the degree of distributional incongruence was fairly substantial. DIF identification 
by MH and STD was acceptable for latent ability distributions that were as much as 1.5 
to 2.0 standard deviations apart. 

These results would appear to hold for tests which contain few DIF items. A 
similar study should be conducted to investigate the effect of the severity of distributional 
incongruence on tests where the occurrence of DIF is more frequent. 



Discussion 
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Footnote 

^Item parameter estimates are not included in this paper but will be provided upon 
request 
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Table 1 
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Figure Captions 

Figure L Bias, SE, and RMSE as a function of overlap for MH 
Figure 2 Bias, SE, and RMSE as a function of overlap for STD 
Figure 3. Hit rates as a function of overlap 
Figure 4, Asymptotic bias as a function of overlap 
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